Predicting movie success has been attempted multiple times over the years, especially since machine computing technology has evolved to the level it is at now. Being able to predict the success of a movie can benefit everyone. The people involved in the production of the movie (producers, actors, etc) are the obvious beneficiaries but there are second and third order effects to their success. Theaters would benefit and the economy could also benefit.
Using different forms of analysis to identify key variables that correlate with successful movies could enable the movie industry as a whole to be more successful. They could use this data to steer them toward profits.
• Download from https://data.world/popculture/imdb-5000-movie-dataset
• This is a very popular dataset for data analytics project at major universities. You can find the data dictionary below.
• More information about the dataset from here.
Variable Name - Description
import pandas as pd
import numpy as np
from scipy import stats
from pandas.tools import plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
#load the data
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 2000)
df = pd.read_csv("data/movie_metadata.csv")
df.head()
# check data types
df.info()
This shows that all columns are either an object, integer, or a float. It also shows that there is a lot of missing data.
# create a column that categorizes imdb_score from 1-4 based on it's rating
df['category'] = 1
df['category'][(df['imdb_score'] > 4) & (df['imdb_score'] <= 6)] = 2
df['category'][(df['imdb_score'] > 6) & (df['imdb_score'] <= 8)] = 3
df['category'][(df['imdb_score'] > 8)] = 4
df.head()
# remove any unnecessary columns
df = df.drop(['color', 'director_name', 'movie_imdb_link', 'language', 'country', 'movie_title', 'title_year',
'aspect_ratio','actor_1_name', 'actor_2_name', 'actor_3_name', 'genres', 'plot_keywords', 'genres'], axis=1)
df.head(1)
df['content_rating'].value_counts()
There are 18 different rating categories currently. It will be very helpful to reduce this number into a more usable number.
#replace the content ratings in order of target audience (1 being childrens movies -> 5 being mature audience movies)
df = df.replace({'content_rating': 'G'}, {'content_rating': 1})
df = df.replace({'content_rating': 'TV-G'}, {'content_rating': 1})
df = df.replace({'content_rating': 'TV-Y'}, {'content_rating': 1})
df = df.replace({'content_rating': 'TV-Y7'}, {'content_rating': 1})
df = df.replace({'content_rating': 'PG'}, {'content_rating': 2})
df = df.replace({'content_rating': 'TV-PG'}, {'content_rating': 2})
df = df.replace({'content_rating': 'GP'}, {'content_rating': 2})
df = df.replace({'content_rating': 'PG-13'}, {'content_rating': 3})
df = df.replace({'content_rating': 'TV-14'}, {'content_rating': 3})
df = df.replace({'content_rating': 'R'}, {'content_rating': 4})
df = df.replace({'content_rating': 'TV-MA'}, {'content_rating': 4})
df = df.replace({'content_rating': 'M'}, {'content_rating': 4})
df = df.replace({'content_rating': 'NC-17'}, {'content_rating': 4})
df = df.replace({'content_rating': 'X'}, {'content_rating': 4})
df = df.replace({'content_rating': 'Unrated'}, {'content_rating': 5})
df = df.replace({'content_rating': 'Not Rated'}, {'content_rating': 5})
df = df.replace({'content_rating': 'Approved'}, {'content_rating': 5})
df = df.replace({'content_rating': 'Passed'}, {'content_rating': 5})
df['content_rating'].value_counts()
Now there are only 5 content rating categories and they are ordered from childrens movies to adult movies.
# check data for any missing values
df.isnull().sum()
There are 16 columns in the dataset. Of those 16, only 5 (or 31%) of them have no missing data.
df = df.dropna().reset_index(drop=True)
df.info()
df.describe()
Here is a snapshot of the numerical columns providing some potentially useful initial data. For example, the average duration of a movie in this dataset is 110 minutes.
# (personal preference) move imdb_score column to the first postion
front = df['imdb_score']
df.drop(labels=['imdb_score'], axis=1,inplace = True)
df.insert(0, 'imdb_score', front)
df.head(1)
# correlation plot
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), vmax=.8, square=True, annot=True, fmt=".2f")
We can see here down the left column that imdb_score has the highest correlation with num_voted_users, duration, num_critic_for_reviews, and num_users_for_reviews. We'll see if this shows up in the analysis later.
# correlation
corr = pd.DataFrame(df.corr()['imdb_score'].drop('imdb_score'))
corr.sort_values(['imdb_score'], ascending = False)
Here is another view that may be easier to read (shows same info as first column in table but sorted).
• Build regression models using different regression algorithms. The Y value is imdb_score. It is important you use important features in your models.
• Evaluate the models
• (Optional for extra points) Scrape some new data (scoring dataset) from the website and deploy your best model and predict imdb_score for the movies in the scoring dataset).
#regression packages
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
#lasso regression
from sklearn import linear_model
#f_regression (feature selection)
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
# recursive feature selection (feature selection)
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from statsmodels.formula.api import ols
# drop category column so there aren't two columns telling us basically the same thing
df_reg = df.drop(['category'], axis=1)
df_reg.head(1)
#assigning columns to X and Y variables
y = df_reg['imdb_score']
X = df_reg.drop(['imdb_score'], axis=1)
model = lm.LinearRegression()
model.fit(X, y)
model_y = model.predict(X)
coef = ["%.3f" % i for i in model.coef_]
xcolumns = [ i for i in X.columns ]
zip(xcolumns, coef)
print "mean square error: ", mean_squared_error(y, model_y)
print "variance or r-squared: ", explained_variance_score(y, model_y)
Here we see that a simple linear regression model with all variables only has an r-squared of 33.5%. That isn't very good.
val_reg_model = ols("imdb_score~num_critic_for_reviews+duration+director_facebook_likes+actor_3_facebook_likes \
+actor_1_facebook_likes+gross+num_voted_users+cast_total_facebook_likes+facenumber_in_poster \
+num_user_for_reviews+content_rating+budget+actor_2_facebook_likes+movie_facebook_likes",df_reg)
val_reg = val_reg_model.fit()
print val_reg.summary()
val_reg.mse_resid
Here we see the same results but in a nice table.
#assigning columns to X and Y variables
y = df_reg['imdb_score']
X2 = df_reg[['num_voted_users', 'budget', 'duration', 'num_user_for_reviews', 'gross']]
model2 = lm.LinearRegression()
model2.fit(X2, y)
model2_y = model2.predict(X2)
print "mean square error: ", mean_squared_error(y, model2_y)
print "variance or r-squared: ", explained_variance_score(y, model2_y)
Running the first model with only 5 variables has a lower r-squared but not significantly lower considering we are using less than half as many variables.
#Fit the lasso model
y = df_reg['imdb_score']
X = df_reg.drop(['imdb_score'], axis=1)
lasso_model = lm.Lasso(alpha=0.1) #higher alpha (penality parameter), fewer predictors
lasso_model.fit(X, y)
lasso_model_y = lasso_model.predict(X)
coef = ["%.3f" % i for i in lasso_model.coef_]
xcolumns = [ i for i in X.columns ]
zip(xcolumns, coef)
a = zip(xcolumns, coef)
df_coef = pd.DataFrame(a)
df_coef.sort_values(1, ascending=False)
print "mean square error: ", mean_squared_error(y, lasso_model_y)
print "variance or r-squared: ", explained_variance_score(y, lasso_model_y)
Running all of the variables through a lasso model did not give any better results. Here we have an r-squared of 32.6%.
y = df_reg['imdb_score']
X = df_reg[['duration', 'num_critic_for_reviews', 'num_user_for_reviews', 'gross']]
lasso_model2 = lm.Lasso(alpha=0.1) #higher alpha (penality parameter), fewer predictors
lasso_model2.fit(X, y)
lasso_model2_y = lasso_model2.predict(X)
coef = ["%.3f" % i for i in lasso_model2.coef_]
xcolumns = [ i for i in X.columns ]
zip(xcolumns, coef)
a = zip(xcolumns, coef)
df_coef = pd.DataFrame(a)
df_coef.sort_values(1, ascending=False)
print "mean square error: ", mean_squared_error(y, lasso_model2_y)
print "variance or r-squared: ", explained_variance_score(y, lasso_model2_y)
We can see that reducing the number of variables made the model worse.
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
#assigning columns to X and Y variables
y = df_reg['imdb_score']
X = df_reg.drop(['imdb_score'], axis=1)
regr = RandomForestRegressor(random_state=0)
regr.fit(X, y)
regr_predicted = regr.predict(X)
print "mean square error: ", mean_squared_error(y, regr_predicted)
print "variance or r-squared: ", explained_variance_score(y, regr_predicted)
As expected, the random forest model gave us an r-squared of 91.8% which is great (especially for this dataset).
sorted(zip(regr.feature_importances_, X.columns))
from yellowbrick.regressor import ResidualsPlot
y = df_reg['imdb_score']
X = df_reg.drop(['imdb_score'], axis=1)
# Instantiate the linear model and visualizer
regre = lm.LinearRegression()
visualizer = ResidualsPlot(regre)
visualizer.fit(X, y) # Fit the training data to the model
visualizer.poof() # Draw/show/poof the data
Our first model with r-squared of 33.5%. We can see that majority of the movies are in the 6-7 imdb score range.
from yellowbrick.regressor import ResidualsPlot
y = df_reg['imdb_score']
X2 = df_reg[['num_voted_users', 'budget', 'duration', 'num_user_for_reviews', 'gross']]
# Instantiate the linear model and visualizer
regre = lm.LinearRegression()
visualizer = ResidualsPlot(regre)
visualizer.fit(X2, y) # Fit the training data to the model
visualizer.poof() # Draw/show/poof the data
The same regression method with less variables. The plot has basically the same shape.
from yellowbrick.regressor import ResidualsPlot
y = df_reg['imdb_score']
X = df_reg.drop(['imdb_score'], axis=1)
# Instantiate the linear model and visualizer
regre = RandomForestRegressor(random_state=0)
visualizer = ResidualsPlot(regre)
visualizer.fit(X, y) # Fit the training data to the model
visualizer.poof() # Draw/show/poof the data
Random forest model using all variables. You can see it is concentrated a lot better than the previous models meaning there is less error.
from yellowbrick.regressor import ResidualsPlot
y = df_reg['imdb_score']
X2 = df_reg[['num_voted_users', 'budget', 'duration', 'num_user_for_reviews', 'gross']]
# Instantiate the linear model and visualizer
regre = RandomForestRegressor(random_state=0)
visualizer = ResidualsPlot(regre)
visualizer.fit(X2, y)
visualizer.poof()
And finally, the random forest model with only 5 variables. It has the same shape as the previous random forest model and the r-squared only went down by 0.7% to 91%.
• The goal is to build a classification model to predict if a movie is good or bad. You need to create a new “categorical” column from imdb_score in order to build classification models. Create the column by “binning” the imdb_score into 4 categories (or buckets): “less than 4, 4~6, 6~8 and 8~10, which represents bad, OK, good and excellent respectively”1.
• It is important that you use different classification algorithms we have learned and evaluate model quality.
• (Optional for extra points) Deploy your best classification model and predict if each movie (in the scoring dataset) is bad, OK, good or excellent.
#import decisiontreeclassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
#import logisticregression classifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
#import knn classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import label_binarize
#for validating your classification model
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
# feature selection
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
df.head(1)
df.info()
df_class = df.astype(int)
df_class.info()
We converted to integers for the classification models.
df_class.groupby(['category']).size()
We can see here that there are:
y = df_class['category']
X = df_class.drop(['imdb_score', 'category'], axis=1)
# split validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Initialize DecisionTreeClassifier()
dt = DecisionTreeClassifier()
# Train a decision tree model
dt.fit(X_train, y_train)
print len(X_train)
print len(y_train)
print len(X_test)
print len(y_test)
Here we can see the breakdown of the training and test sets.
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print metrics.accuracy_score(y_test, dt.predict(X_test))
print "--------------------------------------------------------"
print metrics.confusion_matrix(y_test, dt.predict(X_test))
print "--------------------------------------------------------"
print metrics.classification_report(y_test, dt.predict(X_test))
Here we see:
This means that:
import scikitplot as skplt
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test), y_pred=dt.predict(X_test))
plt.show()
This chart shows the same information, it is just a little easier to read.
# Decision tree display
tree.export_graphviz(dt, out_file='data/moviedecisiontree.dot', feature_names=X.columns)
from IPython.display import Image
Image("data/moviedecisiontree.png")
# This is a "full-grown" tree
Here is a display of the decision tree that was generated. Ovbiously it is impossible to read like this but you can see just how big it is.
from IPython.display import IFrame
IFrame('data/moviedecisiontree.png', width=1000, height=1000)
This visual is a little more user friendly
#declare X variables and y variable
y = df_class['category']
X = df_class.drop(['imdb_score', 'category'], axis=1)
# evaluate the model by splitting into train and test sets & develop knn model (name it as knn)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# initialize KNeighborsClassifier() and train a KNN Model
#knn = KNeighborsClassifier()
knn = KNeighborsClassifier(n_neighbors=3) #don't have to have stuff in brackets, can be blank
knn.fit(X_train, y_train)
#Model evaluation without valdation
# Find out the performance of this model & interpret the results
print metrics.accuracy_score(y_test, knn.predict(X_test))
print "--------------------------------------------------------"
print metrics.confusion_matrix(y_test, knn.predict(X_test))
print "--------------------------------------------------------"
print metrics.classification_report(y_test, knn.predict(X_test))
print "--------------------------------------------------------"
#print metrics.roc_auc_score(y_test, knn.predict(X_test))
I'm not going to break this one down by score category but we can see that this model is worse at 58.2%.
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test), y_pred=knn.predict(X_test))
plt.show()
#declare X variables and y variable
y = df_class['category']
X = df_class.drop(['imdb_score', 'category'], axis=1)
# evaluate the model by splitting into train and test sets and build a logistic regression model
# name it as "lr"
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
lr = LogisticRegression(multi_class='multinomial', solver ='newton-cg', max_iter=1000)
lr.fit(X_train, y_train)
#Model evaluation
# Find out the performance of this model
print metrics.accuracy_score(y_test, lr.predict(X_test))
print "--------------------------------------------------------"
print metrics.confusion_matrix(y_test, lr.predict(X_test))
print "--------------------------------------------------------"
print metrics.classification_report(y_test, lr.predict(X_test))
We can see that this model was slightle less accurate at 68.7%. Looking at where the differences were from the decision tree model:
This means that:
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test), y_pred=lr.predict(X_test))
plt.show()
# evaluate the logit model using 10-fold cross-validation
scores = cross_val_score(lr, X, y, scoring='accuracy', cv=10)
print scores
print scores.mean()
Note: this took a long time to run since I increseased the max number of iterations to 1000 (default is 100))
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=20) #building 20 decision trees
clf=clf.fit(X_train, y_train)
clf.score(X_test, y_test)
# generate evaluation metrics
print metrics.accuracy_score(y_test, clf.predict(X_test)) #overall accuracy
print metrics.confusion_matrix(y_test, clf.predict(X_test))
print metrics.classification_report(y_test, clf.predict(X_test))
Again, as expected, random forest had an accuracy of 77.7% which is much better than the previous models. Notice that all of the models had trouble classifying the bad movies and did the best in classifying the good movies.
y = df['category']
X = df.drop(['imdb_score', 'category'], axis=1)
X_new = SelectKBest(chi2, k=3).fit_transform(X, y)
print X_new
selector = SelectKBest(chi2, k=3).fit(X, y)
idxs_selected = selector.get_support(indices=True)
print idxs_selected
X.head(1)
Looking at the variables, we can see that gross, num_voted_users, and budget were the top 3.
# evaluate the model by splitting into train (70%) and test sets (30%)
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.3, random_state=0)
dt = tree.DecisionTreeClassifier()
dt.fit(X_train, y_train)
#Model evaluation
print metrics.accuracy_score(y_test, dt.predict(X_test))
print "--------------------------------------------------------"
print metrics.confusion_matrix(y_test, dt.predict(X_test))
print "--------------------------------------------------------"
print metrics.classification_report(y_test, dt.predict(X_test))
print "--------------------------------------------------------"
And here we see that after using feature selection, a decision tree classifier gives us an accuracy of 64.4%
y = df['category']
X = df.drop(['imdb_score', 'category'], axis=1)
# build ExtraTreesClassifier
model_extra = ExtraTreesClassifier()
model_extra.fit(X, y)
model_extra.score(X, y)
# display the relative importance of each attribute
print(model_extra.feature_importances_)
print "Features sorted by their rank:"
print sorted(zip(map(lambda x: round(x, 4), model_extra.feature_importances_), X.columns))
This shows that according to the ExtraTreesClassifier, num_voted_users, duration, and budget are the most important variables.
• Analyze the data using K-means algorithm and Hierarchical clustering algorithm. You determine the optimal K value for K-means. This is exploratory data analysis and you need to report the movie “profiles” based on clustering analysis.
#import decisiontreeclassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from IPython.display import SVG
from graphviz import Source
from IPython.display import display
#import logisticregression classifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
#import knn classifier
from sklearn.neighbors import KNeighborsClassifier
#for validating your classification model
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split, GridSearchCV
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
# feature selection
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import ward_tree
from sklearn.metrics import pairwise_distances
from scipy.cluster.hierarchy import dendrogram, linkage, ward
from scipy.spatial.distance import cdist
df.head(1)
df_clus = df.drop(['category'], axis=1)
# variance test
df_clus.var()
#normalize each column and print the first 5 rows
df_norm = (df_clus - df_clus.mean()) / (df_clus.max() - df_clus.min())
df_norm.head()
df_norm.var()
y = df_norm['imdb_score']
X = df_norm.drop(['imdb_score'], axis=1)
#The Elbow method
K = range(1, 10)
meandistortions = []
for k in K:
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
plt.plot(K, meandistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
plt.show()
Using the elbow method, I would probably choose to use k=3 since the line levels out a little more between 3 and 4.
k_means = KMeans(init='k-means++', n_clusters=3, random_state=0)
k_means.fit(df_norm)
k_means.cluster_centers_
# add cluster label into the dataset as a column
df1 = pd.DataFrame(k_means.labels_, columns = ['cluster'])
df1.head()
df_clus = df_clus.reset_index(drop=True)
df1 = df1.reset_index(drop=True)
df2 = df_clus.join(df1)
df2.tail()
Displaying this allows us to verify that merging the tables did not create any null values. Now the cluster column is at the end of the table.
df2.groupby('cluster').mean()
This allows us to build profiles based on mean values for each cluster. We will analyze it further for the summary at the end but a summary follows:
Cluster 1:
Cluster 2:
Cluster 3:
# Display cluster sizes
df2.groupby(['cluster']).size()
We can see that cluster 0 is about 1/3 of cluster 1 and 2.
df_clus2 = df.drop(['imdb_score'], axis=1)
#normalize each column and print the first 5 rows
df_norm2 = (df_clus2 - df_clus2.mean()) / (df_clus2.max() - df_clus2.min())
df_norm2.head()
y = df_norm2['category']
X = df_norm2.drop(['category'], axis=1)
k_means = KMeans(init='k-means++', n_clusters=3, random_state=0)
k_means.fit(df_norm2)
# add cluster label into the dataset as a column
df1 = pd.DataFrame(k_means.labels_, columns = ['cluster'])
df_clus2 = df_clus2.reset_index(drop=True)
df1 = df1.reset_index(drop=True)
df3 = df_clus2.join(df1)
df3.tail(1)
df3.groupby('cluster').mean()
# Display cluster sizes
df3.groupby(['cluster']).size()
We can see that there is a little more even dispersion with these clusters.
Some cluster differences worth noting (all values will be mean values for that variable and cluster):
Cluster 1:
Cluster 2:
Cluster 3:
np.random.seed(1) # setting random seed to get the same results each time.
agg= AgglomerativeClustering(n_clusters=3, linkage='ward').fit(X)
agg.labels_
plt.figure(figsize=(16,10))
linkage_matrix = ward(X)
dendrogram(linkage_matrix, orientation="left")
plt.tight_layout() # fixes margins
Looking at this model, it appears that it may be best to use somewhere between 3 and 5 clusters.
# add cluster label into the dataset as a column
df1 = pd.DataFrame(agg.labels_, columns = ['cluster'])
df1.head()
df2 = df_clus.join(df1)
df2.head()
df2.groupby('cluster').mean()
We can see that there is a little more even dispersion with these clusters.
Some cluster differences worth noting (all values will be mean values for that variable and cluster):
Cluster 1:
Cluster 2:
Cluster 3:
df2.groupby('cluster').size()
We can see that Agglomerative Clustering split two of the clusters fairly even and the other one is a little less than half the size of the others (this was fairly similar to K-means).
sns.lmplot("cluster", "imdb_score", df2, x_jitter=.15, y_jitter=.15)
#create box plots to visualize content_rating and imdb_score
df2.boxplot('imdb_score', by='cluster', figsize=(12, 8))
We can see there is a noticable difference in imdb_score among the three clusters.
sns.lmplot("cluster", "num_critic_for_reviews", df2, x_jitter=.15, y_jitter=.15)
There is also a significand difference in num_critic_for_review between the clusters.
sns.lmplot("cluster", "duration", df2, x_jitter=.15, y_jitter=.15)
sns.lmplot("cluster", "num_voted_users", df2, x_jitter=.15, y_jitter=.15)
sns.lmplot("cluster", "content_rating", df2, x_jitter=.15, y_jitter=.15)
sns.lmplot("cluster", "budget", df2, x_jitter=.15, y_jitter=.15)
#create box plots to visualize content_rating and imdb_score
df2.boxplot('imdb_score', by='content_rating', figsize=(12, 8))
df.groupby('category').mean()
Doing a simple groupby with the categories that we created can also provide some very useful information. I think this is very generalized information though because you can see that nearly all means increase as the imdb category increases from 1-4. The only variables that do not display this behavior are num_user_for_reviews, and facenumber_in_poster. This is why the deeper analysis that took place in this project can be more revealing.
• At the end, this is what your client is interested in. Develop useful insights from your models (regression, classification, and clustering). Write a summery using bulleted lists and/or numbers in markdown cells. If this section is “too thin”, your project will receive a low grade.
Here are the results of the clusters that was also summarized above (these are the mean values for each cluster or profile):
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 1:
Cluster 2:
Cluster 3:
Based on the cluster analysis shown above, there is evidence that potentially:
The random forest classifier is by far the most accurate to use for classification. I was able to achieve around a 74% accuracy with it. All of the classifiers seem to have trouble differentiating the bad movies though.
Random Forest was also significantly better than the other methods for regression analysis. Using all 14 variables that I narrowed down to, I was able to get a 92.4% r-squared which is very good. This shows that it may be very possible to predict how well a movie will be rated based on these variables. Even dropping down to num_voted_users, budget, duration, num_user_for_reviews, gross, I still had a 90% r-squared.